Goto

Collaborating Authors

 ve bayes classifier


Automated Bug Report Prioritization in Large Open-Source Projects

arXiv.org Artificial Intelligence

--Large open-source projects receive a large number of issues (known as bugs), including software defect (i.e., bug) reports and new feature requests from their user and developer communities at a fast rate. The often limited project resources do not allow them to deal with all issues. Instead, they have to prioritize them according to the project's priorities and the issues' severities. In this paper, we propose a novel approach to automated bug prioritization based on the natural language text of the bug reports that are stored in the open bug repositories of the issue-tracking systems. We conduct topic modeling using a variant of LDA called T opicMiner-MTM and text classification with the BERT large language model to achieve a higher performance level compared to the state-of-the-art. Experimental results using an existing reference dataset containing 85,156 bug reports of the Eclipse Platform project indicate that we outperform existing approaches in terms of Accuracy, Precision, Recall, and F1-measure of the bug report priority prediction. Index T erms --automated bug prioritization, automated bug triage, mining software repositories, machine learning, natural language processing I. I NTRODUCTION Large open-source projects offer an issue-tracking system with an open bug repository, where developers and users can report the software defects they find or any new feature requests they may have. These reports are called bug reports . However, the projects' resources are limited, while processing and resolving the bug reports is typically very costly. Hence, not all bug reports in the open bug repository can be processed and handled at once.


Fractional Naive Bayes (FNB): non-convex optimization for a parsimonious weighted selective naive Bayes classifier

arXiv.org Machine Learning

We study supervised classification for datasets with a very large number of input variables. The na\"ive Bayes classifier is attractive for its simplicity, scalability and effectiveness in many real data applications. When the strong na\"ive Bayes assumption of conditional independence of the input variables given the target variable is not valid, variable selection and model averaging are two common ways to improve the performance. In the case of the na\"ive Bayes classifier, the resulting weighting scheme on the models reduces to a weighting scheme on the variables. Here we focus on direct estimation of variable weights in such a weighted na\"ive Bayes classifier. We propose a sparse regularization of the model log-likelihood, which takes into account prior penalization costs related to each input variable. Compared to averaging based classifiers used up until now, our main goal is to obtain parsimonious robust models with less variables and equivalent performance. The direct estimation of the variable weights amounts to a non-convex optimization problem for which we propose and compare several two-stage algorithms. First, the criterion obtained by convex relaxation is minimized using several variants of standard gradient methods. Then, the initial non-convex optimization problem is solved using local optimization methods initialized with the result of the first stage. The various proposed algorithms result in optimization-based weighted na\"ive Bayes classifiers, that are evaluated on benchmark datasets and positioned w.r.t. to a reference averaging-based classifier.


Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

arXiv.org Machine Learning

This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Na\"{\i}ve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Na\"{\i}ve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case.


The Impact of Twitter Sentiments on Stock Market Trends

arXiv.org Artificial Intelligence

The Web is a vast virtual space where people can share their opinions, impacting all aspects of life and having implications for marketing and communication. The most up-to-date and comprehensive information can be found on social media because of how widespread and straightforward it is to post a message. Proportionately, they are regarded as a valuable resource for making precise market predictions. In particular, Twitter has developed into a potent tool for understanding user sentiment. This article examines how well tweets can influence stock symbol trends. We analyze the volume, sentiment, and mentions of the top five stock symbols in the S&P 500 index on Twitter over three months. Long Short-Term Memory, Bernoulli Na\"ive Bayes, and Random Forest were the three algorithms implemented in this process. Our study revealed a significant correlation between stock prices and Twitter sentiment.


The Naive Bayes classifier: How it works

#artificialintelligence

Classification algorithms try to predict the class or the label of the categorical target variable. A categorical variable typically represents qualitative data that has discrete values, such as pass/fail or low/medium/high, etc. Out of the many classification algorithms, the Naïve Bayes classifier is one of the simplest classification algorithms. The Naïve Bayes classifier is often used with large text datasets among other applications. The aim of this article is to explain how the Naive Bayes algorithm works.


Estimating IRI based on pavement distress type, density, and severity: Insights from machine learning techniques

arXiv.org Machine Learning

Surface roughness is primary measure of pavement performance that has been associated with ride quality and vehicle operating costs. Of all the surface roughness indicators, the International Roughness Index (IRI) is the most widely used. However, it is costly to measure IRI, and for this reason, certain road classes are excluded from IRI measurements at a network level. Higher levels of distresses are generally associated with higher roughness. However, for a given roughness level, pavement data typically exhibits a great deal of variability in the distress types, density, and severity. It is hypothesized that it is feasible to estimate the IRI of a pavement section given its distress types and their respective densities and severities. To investigate this hypothesis, this paper uses data from in-service pavements and machine learning methods to ascertain the extent to which IRI can be predicted given a set of pavement attributes. The results suggest that machine learning can be used reliably to estimate IRI based on the measured distress types and their respective densities and severities. The analysis also showed that IRI estimated this way depends on the pavement type and functional class. The paper also includes an exploratory section that addresses the reverse situation, that is, estimating the probability of pavement distress type distribution and occurrence severity/extent based on a given roughness level.


NAÏVE Bayes Classifier

#artificialintelligence

Let us talk about Bayesian Network. Bayesian Network is a probablistic model represent a set of random variables and their conditional dependencies. This model can be represented using DAG (Directed Acrylic Graph) where nodes can be observable quantities, latent variables (not observable, inferred only) and not known parameters or hypothesis. DAG can help to understand the model in a easy manner. Edges in DAG represents conditional dependencies between nodes.


A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

arXiv.org Artificial Intelligence

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features {\sl regardless} of the model used. This limit, namely the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.


Machine learning made easy with Python

#artificialintelligence

Naïve Bayes is a classification technique that serves as the basis for implementing several classifier modeling algorithms. Naïve Bayes-based classifiers are considered some of the simplest, fastest, and easiest-to-use machine learning techniques, yet are still effective for real-world applications. Naïve Bayes is based on Bayes' theorem, formulated by 18th-century statistician Thomas Bayes. This theorem assesses the probability that an event will occur based on conditions related to the event. For example, an individual with Parkinson's disease typically has voice variations; hence such symptoms are considered related to the prediction of a Parkinson's diagnosis.


How machine learning removes spam from your inbox

#artificialintelligence

This article is part of "Deconstructing artificial intelligence," a series of posts that explore the details of how AI applications work. Of more than 300 billion emails sent every day, at least half are spam. Email providers have the huge task of filtering out the spam and making sure their users receive the messages that matter. The line between spam and non-spam messages is fuzzy, and the criteria change over time. From various efforts to automate spam detection, machine learning has so far proven to be the most effective and the favored approach by email providers.